home environment
the complete method is significantly different from prior methods ([25,37,38,41]) tackling the object goal navigation
We thank the reviewers for their valuable feedback and comments. R3 & R5 point out that parts of some modules are based on prior work. Novelty is also recognized by R1 ("clear algorithmic innovation") and R2 ("adds several new features"). All reviewers have appreciated the real-world experiments in the submission. R1 & R5 have suggested there should be more emphasis on real-world experiments.
Evaluating Multimodal Large Language Models with Daily Composite Tasks in Home Environments
Zhang, Zhenliang, Wang, Yuxi, Xie, Hongzhao, Zhao, Shiyun, Liu, Mingyuan, Lu, Yujie, He, Xinyi, Cheng, Zhenku, Peng, Yujia
A key feature differentiating artificial general intelligence (AGI) from traditional AI is that AGI can perform composite tasks that require a wide range of capabilities. Although embodied agents powered by multimodal large language models (MLLMs) offer rich perceptual and interactive capabilities, it remains largely unexplored whether they can solve composite tasks. In the current work, we designed a set of composite tasks inspired by common daily activities observed in early childhood development. Within a dynamic and simulated home environment, these tasks span three core domains: object understanding, spatial intelligence, and social activity. We evaluated 17 leading proprietary and open-source MLLMs on these tasks. The results consistently showed poor performance across all three domains, indicating a substantial gap between current capabilities and general intelligence requirements. Together, our tasks offer a preliminary framework for evaluating the general capabilities of embodied agents, marking an early but significant step toward the development of embodied MLLMs and their real-world deployment.
Eye Care You: Voice Guidance Application Using Social Robot for Visually Impaired People
Lin, Ting-An, Tsai, Pei-Lin, Chen, Yi-An, Chen, Feng-Yu, Chen, Lyn Chao-ling
In the study, the device of social robot was designed for visually impaired users, and along with a mobile application for provide functions to assist their lives. Both physical and mental conditions of visually impaired users are considered, and the mobile application provides functions: photo record, mood lift, greeting guest and today highlight. The application was designed for visually impaired users, and uses voice control to provide a friendly interface. Photo record function allows visually impaired users to capture image immediately when they encounter danger situations. Mood lift function accompanies visually impaired users by asking questions, playing music and reading articles. Greeting guest function answers to the visitors for the inconvenient physical condition of visually impaired users. In addition, today highlight function read news including weather forecast, daily horoscopes and daily reminder for visually impaired users. Multiple tools were adopted for developing the mobile application, and a website was developed for caregivers to check statues of visually impaired users and for marketing of the application.
Estimating Respiratory Effort from Nocturnal Breathing Sounds for Obstructive Sleep Apnoea Screening
Xu, Xiaolei, Niu, Chaoyue, Brown, Guy J., Romero, Hector, Ma, Ning
Obstructive sleep apnoea (OSA) is a prevalent condition with significant health consequences, yet many patients remain undiagnosed due to the complexity and cost of over-night polysomnography. Acoustic-based screening provides a scalable alternative, yet performance is limited by environmental noise and the lack of physiological context. Respiratory effort is a key signal used in clinical scoring of OSA events, but current approaches require additional contact sensors that reduce scalability and patient comfort. This paper presents the first study to estimate respiratory effort directly from nocturnal audio, enabling physiological context to be recovered from sound alone. We propose a latent-space fusion framework that integrates the estimated effort embeddings with acoustic features for OSA detection. Using a dataset of 157 nights from 103 participants recorded in home environments, our respiratory effort estimator achieves a concordance correlation coefficient of 0.48, capturing meaningful respiratory dynamics. Fusing effort and audio improves sensitivity and AUC over audio-only baselines, especially at low apnoea-hypopnoea index thresholds. The proposed approach requires only smartphone audio at test time, which enables sensor-free, scalable, and longitudinal OSA monitoring.
HiLWS: A Human-in-the-Loop Weak Supervision Framework for Curating Clinical and Home Video Data for Neurological Assessment
Irani, Atefeh, Mirian, Maryam S., Lassooij, Alex, Hosseini, Reshad, Moradi, Hadi, McKeown, Martin J.
Video-based assessment of motor symptoms in conditions such as Parkinson's disease (PD) offers a scalable alternative to in-clinic evaluations, but home-recorded videos introduce significant challenges, including visual degradation, inconsistent task execution, annotation noise, and domain shifts. We present HiLWS, a cascaded human-in-the-loop weak supervision framework for curating and annotating hand motor task videos from both clinical and home settings. Unlike conventional single-stage weak supervision methods, HiLWS employs a novel cascaded approach, first applies weak supervision to aggregate expert-provided annotations into probabilistic labels, which are then used to train machine learning models. Model predictions, combined with expert input, are subsequently refined through a second stage of weak supervision. The complete pipeline includes quality filtering, optimized pose estimation, and task-specific segment extraction, complemented by context-sensitive evaluation metrics that assess both visual fidelity and clinical relevance by prioritizing ambiguous cases for expert review. Our findings reveal key failure modes in home recorded data and emphasize the importance of context-sensitive curation strategies for robust medical video analysis.
OpenGuide: Assistive Object Retrieval in Indoor Spaces for Individuals with Visual Impairments
Xu, Yifan, Wang, Qianwei, Kamat, Vineet, Menassa, Carol
Indoor built environments like homes and offices often present complex and cluttered layouts that pose significant challenges for individuals who are blind or visually impaired, especially when performing tasks that involve locating and gathering multiple objects. While many existing assistive technologies focus on basic navigation or obstacle avoidance, few systems provide scalable and efficient multi-object search capabilities in real-world, partially observable settings. To address this gap, we introduce OpenGuide, an assistive mobile robot system that combines natural language understanding with vision-language foundation models (VLM), frontier-based exploration, and a Partially Observable Markov Decision Process (POMDP) planner. OpenGuide interprets open-vocabulary requests, reasons about object-scene relationships, and adaptively navigates and localizes multiple target items in novel environments. Our approach enables robust recovery from missed detections through value decay and belief-space reasoning, resulting in more effective exploration and object localization. We validate OpenGuide in simulated and real-world experiments, demonstrating substantial improvements in task success rate and search efficiency over prior methods. This work establishes a foundation for scalable, human-centered robotic assistance in assisted living environments.
Context-Aware Risk Estimation in Home Environments: A Probabilistic Framework for Service Robots
Ishii, Sena, Chikhalikar, Akash, Ravankar, Ankit A., Luces, Jose Victorio Salazar, Hirata, Yasuhisa
-- We present a novel framework for estimating accident-prone regions in everyday indoor scenes, aimed at improving real-time risk awareness in service robots operating in human-centric environments. As robots become integrated into daily life, particularly in homes, the ability to anticipate and respond to environmental hazards is crucial for ensuring user safety, trust, and effective human-robot interaction. Each object is represented as a node with an associated risk score, and risk propagates asymmetrically from high-risk to low-risk objects based on spatial proximity and accident relationship. This enables the robot to infer potential hazards even when they are not explicitly visible or labeled. Designed for interpretability and lightweight onboard deployment, our method is validated on a dataset with human-annotated risk regions, achieving a binary risk detection accuracy of 75%. The system demonstrates strong alignment with human perception, particularly in scenes involving sharp or unstable objects. These results underline the potential of context-aware risk reasoning to enhance robotic scene understanding and proactive safety behaviors in shared human-robot spaces. This framework could serve as a foundation for future systems that make context-driven safety decisions, provide real-time alerts, or autonomously assist users in avoiding or mitigating hazards within home environments. As service robots become increasingly integrated into daily life--supporting tasks such as cleaning, acting as communication companions, searching for objects or navigating shared spaces [1]-[3]--their roles are expected to expand beyond single-function behaviors. With recent advances in embodied intelligence and large language models, these robots are beginning to understand complex instructions and act autonomously in diverse home environments.
Environment Modeling for Service Robots From a Task Execution Perspective
Zhang, Ying, Tian, Guohui, Zhang, Cui-Hua, Hua, Changchun, Ding, Weili, Ahn, Choon Ki
Service robots are increasingly entering the home to provide domestic tasks for residents. However, when working in an open, dynamic, and unstructured home environment, service robots still face challenges such as low intelligence for task execution and poor long-term autonomy (LTA), which has limited their deployment. As the basis of robotic task execution, environment modeling has attracted significant attention. This integrates core technologies such as environment perception, understanding, and representation to accurately recognize environmental information. This paper presents a comprehensive survey of environmental modeling from a new task-executionoriented perspective. In particular, guided by the requirements of robots in performing domestic service tasks in the home environment, we systematically review the progress that has been made in task-execution-oriented environmental modeling in four respects: 1) localization, 2) navigation, 3) manipulation, and 4) LTA. Current challenges are discussed, and potential research opportunities are also highlighted.
Sound Tagging in Infant-centric Home Soundscapes
Khan, Mohammad Nur Hossain, Li, Jialu, McElwain, Nancy L., Hasegawa-Johnson, Mark, Islam, Bashima
Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focused on data collected from a non-stationary microphone placed in the environment or from the perspective of adults. Further, many of these works ignore infants or young children in the environment or have data collected from only a single family where noise from the fixed sound source can be moderate at the infant's position or vice versa. Thus, despite the recent success of large pre-trained models for noise event detection, the performance of these models on infant-centric noise soundscapes in the home is yet to be explored. To bridge this gap, we have collected and labeled noises in home soundscapes from 22 families in an unobtrusive manner, where the data are collected through an infant-worn recording device. In this paper, we explore the performance of a large pre-trained model (Audio Spectrogram Transformer [AST]) on our noise-conditioned infant-centric environmental data as well as publicly available home environmental datasets. Utilizing different training strategies such as resampling, utilizing public datasets, mixing public and infant-centric training sets, and data augmentation using noise and masking, we evaluate the performance of a large pre-trained model on sparse and imbalanced infant-centric data. Our results show that fine-tuning the large pre-trained model by combining our collected dataset with public datasets increases the F1-score from 0.11 (public datasets) and 0.76 (collected datasets) to 0.84 (combined datasets) and Cohen's Kappa from 0.013 (public datasets) and 0.77 (collected datasets) to 0.83 (combined datasets) compared to only training with public or collected datasets, respectively.